The LendingClub is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to default.
In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be charged off and possibly go into default. In this assignment you will:
Let's get started!
Make sure you have the latest version of GraphLab Create. If you don't find the decision tree module, then you would need to upgrade GraphLab Create using
pip install graphlab-create --upgrade
import graphlab as gl
print('gl.version: %s' % (gl.version))
gl.canvas.set_target('ipynb')
import math
import string
# my imports
import pandas as pd
import numpy as np
from types import MethodType
def value_counts( self ):
import pandas as pd
pdDf = self.to_dataframe()
for ftr in pdDf.columns:
print(pdDf[ftr].value_counts())
#SFrame.value_counts = MethodType(value_counts, None, SFrame)
#setattr(SFrame, 'value_counts', value_counts)
#setattr(glbObsAll, 'value_counts', value_counts)
We will be using a dataset from the LendingClub. A parsed and cleaned form of the dataset is availiable here. Make sure you download the dataset before running the following command.
glbObsAll = gl.SFrame('data/lending-club-data.gl/')
print(glbObsAll.shape)
glbObsAll.show()
print(glbObsAll)
Let's quickly explore what the dataset looks like. First, let's print out the column names to see what features we have in this dataset.
glbObsAll.column_names()
Here, we see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc. Let's take a look at the distribution of loan grades in the dataset.
glbObsAll['grade'].show()
We can see that over half of the loan grades are assigned values B or C. Each loan is assigned one of these grades, along with a more finely discretized feature called subgrade (feel free to explore that feature column as well!). These values depend on the loan application and credit report, and determine the interest rate of the loan. More information can be found here.
Now, let's look at a different feature.
glbObsAll['home_ownership'].show()
This feature describes whether the loanee is mortaging, renting, or owns a home. We can see that a small percentage of the loanees own a home.
The target column (label column) of the dataset that we are interested in is called bad_loans. In this column 1 means a risky (bad) loan 0 means a safe loan.
In order to make this more intuitive and consistent with the lectures, we reassign the target to be:
We put this in a new column called safe_loan.
print(glbObsAll['bad_loans'])
# safe_loan = 1 => safe
# safe_loan = -1 => risky
#glbObsAll['bad_loans'][:5].apply(lambda x : +1 if x == 0 else -1)
glbObsAll['safe_loan'] = glbObsAll['bad_loans'].apply(lambda x : +1 if x==0 else -1)
#glbObsAll = glbObsAll.remove_column('bad_loans')
Now, let us explore the distribution of the column safe_loan. This gives us a sense of how many safe and risky glbObsAll are present in the dataset.
glbObsAll['safe_loan'].show(view = 'Categorical')
You should have:
It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.
In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features.
features = ['grade', # grade of the loan
'sub_grade', # sub-grade of the loan
'short_emp', # one year or less of employment
'emp_length_num', # number of years of employment
'home_ownership', # home_ownership status: own, mortgage or rent
'dti', # debt to income ratio
'purpose', # the purpose of the loan
'term', # the term of the loan
'last_delinq_none', # has borrower had a delinquincy
'last_major_derog_none', # has borrower had 90 day or worse rating
'revol_util', # percent of available credit being used
'total_rec_late_fee', # total late fees received to day
]
target = 'safe_loan' # prediction target (y) (+1 means safe, -1 is risky)
# Extract the feature columns and target column
glbObsAll = glbObsAll[features + [target]]
print(glbObsAll.shape)
What remains now is a subset of features and the target that we will use for the rest of this notebook.
As we explored above, our data is disproportionally full of safe loans. Let's create two datasets: one with just the safe loans (glbObsSfe) and one with just the risky loans (glbObsRsk).
glbObsSfe = glbObsAll[glbObsAll[target] == +1]
glbObsRsk = glbObsAll[glbObsAll[target] == -1]
print "Number of safe loans : %s" % len(glbObsSfe)
print "Number of risky loans : %s" % len(glbObsRsk)
Now, write some code to compute below the percentage of safe and risky glbObsAll in the dataset and validate these numbers against what was given using .show earlier in the assignment:
print("Percentage of safe loans: %.2f%%" % (glbObsSfe.shape[0] * 100.0 / glbObsAll.shape[0]))
print("Percentage of risky loans: %.2f%%" % (glbObsRsk.shape[0] * 100.0 / glbObsAll.shape[0]))
One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used seed=1 so everyone gets the same results.
# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(glbObsRsk)/float(len(glbObsSfe))
#risky_glbObsAll = glbObsRsk
glbObsSfeSmp = glbObsSfe.sample(percentage, seed=1)
# Append the risky loans with the downsampled version of safe loans
glbObsSmp = glbObsRsk.append(glbObsSfeSmp)
print(glbObsSmp.shape)
Now, let's verify that the resulting percentage of safe and risky glbObsAll are each nearly 50%.
print("Percentage of safe loans: %.2f%%" %
(glbObsSmp[glbObsSmp[target] == +1].shape[0] * 100.0 / glbObsSmp.shape[0]))
print("Percentage of risky loans: %.2f%%" %
(glbObsSmp[glbObsSmp[target] == -1].shape[0] * 100.0 / glbObsSmp.shape[0]))
# print "Percentage of safe glbObsAll :", len(safe_loan) / float(len(glbObsSmp))
# print "Percentage of risky glbObsAll :", len(risky_glbObsAll) / float(len(glbObsSmp))
#print "Total number of glbObsAll in our new dataset :", len(glbObsSmp)
Note: There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this paper. For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.
We split the data into training and validation sets using an 80/20 split and specifying seed=1 so everyone gets the same results.
Note: In previous assignments, we have called this a train-test split. However, the portion of data that we don't train on will be used to help select model parameters (this is known as model selection). Thus, this portion of data should be called a validation set. Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that.
glbObsFit, glbObsOOB = glbObsSmp.random_split(.8, seed=1)
print(glbObsFit.shape)
print(glbObsOOB.shape)
Now, let's use the built-in GraphLab Create decision tree learner to create a loan prediction model on the training data. (In the next assignment, you will implement your own decision tree learning algorithm.) Our feature columns and target column have already been decided above. Use validation_set=None to get the same results as everyone else.
dtrMdl = gl.decision_tree_classifier.create(glbObsFit, validation_set=None,
target = target, features = features)
dtrMdl.show(view="Tree")
As noted in the documentation, typically the the max depth of the tree is capped at 6. However, such a tree can be hard to visualize graphically. Here, we instead learn a smaller model with max depth of 2 to gain some intuition by visualizing the learned tree.
dpth2Mdl = gl.decision_tree_classifier.create(glbObsFit, validation_set=None,
target = target, features = features, max_depth = 2)
In the view that is provided by GraphLab Create, you can see each node, and each split at each node. This visualization is great for considering what happens when this model predicts the target of a new data point.
Note: To better understand this visual:
dpth2Mdl.show(view="Tree")
Let's consider two positive and two negative examples from the validation set and see what the model predicts. We will do the following:
glbObsOOBSfe = glbObsOOB[glbObsOOB[target] == +1]
glbObsOOBRsk = glbObsOOB[glbObsOOB[target] == -1]
smpObsOOBRsk = glbObsOOBRsk[0:2]
smpObsOOBSfe = glbObsOOBSfe[0:2]
smpObsOOB = smpObsOOBSfe.append(smpObsOOBRsk)
smpObsOOB
Now, we will use our model to predict whether or not a loan is likely to default. For each row in the smpObsOOB, use the dtrMdl to predict whether or not the loan is classified as a safe loan.
Hint: Be sure to use the .predict() method.
tgtP = target + '.P'
smpObsOOB[tgtP] = dtrMdl.predict(smpObsOOB)
print(smpObsOOB)
Quiz Question: What percentage of the predictions on smpObsOOB did dtrMdl get correct?
For each row in the smpObsOOB, what is the probability (according dtrMdl) of a loan being classified as safe?
Hint: Set output_type='probability' to make probability predictions using dtrMdl on smpObsOOB:
tgtPPrbby = target + '.PPrbby'
smpObsOOB[tgtPPrbby] = dtrMdl.predict(smpObsOOB, output_type='probability')
print(smpObsOOB)
Quiz Question: Which loan has the highest probability of being classified as a safe loan?
Checkpoint: Can you verify that for all the predictions with probability >= 0.5, the model predicted the label +1?
Now, we will explore something pretty interesting. For each row in the smpObsOOB, what is the probability (according to dpth2Mdl) of a loan being classified as safe?
Hint: Set output_type='probability' to make probability predictions using dpth2Mdl on smpObsOOB:
smpObsOOB[target + '.dpth2P.Prbby'] = dpth2Mdl.predict(smpObsOOB, output_type='probability')
print(smpObsOOB)
Quiz Question: Notice that the probability preditions are the exact same for the 2nd and 3rd glbObsAll i.e 0.472267584643798. Why would this happen?
Note that you should be able to look at the small tree, traverse it yourself, and visualize the prediction being made. Consider the following point in the smpObsOOB
smpObsOOB[1]
Let's visualize the small tree here to do the traversing for this data point.
dpth2Mdl.show(view="Tree")
Note: In the tree visualization above, the values at the leaf nodes are not class predictions but scores (a slightly advanced concept that is out of the scope of this course). You can read more about this here. If the score is $\geq$ 0, the class +1 is predicted. Otherwise, if the score < 0, we predict class -1.
Quiz Question: Based on the visualized tree, what prediction would you make for this data point?
Now, let's verify your prediction by examining the prediction made using GraphLab Create. Use the .predict function on dpth2Mdl.
Recall that the accuracy is defined as follows: $$ \mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}} $$
Let us start by evaluating the accuracy of the dpth2Mdl and dtrMdl on the training data
print dpth2Mdl.evaluate(glbObsFit)['accuracy']
print dtrMdl.evaluate(glbObsFit)['accuracy']
Checkpoint: You should see that the dpth2Mdl performs worse than the dtrMdl on the training data.
Now, let us evaluate the accuracy of the dpth2Mdl and dtrMdl on the entire glbObsOOB, not just the subsample considered above.
print dpth2Mdl.evaluate(glbObsOOB)['accuracy']
print dtrMdl.evaluate(glbObsOOB)['accuracy']
Quiz Question: What is the accuracy of dtrMdl on the validation set, rounded to the nearest .01?
Here, we will train a large decision tree with max_depth=10. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.
dpth10Mdl = gl.decision_tree_classifier.create(glbObsFit, validation_set=None,
target = target, features = features, max_depth = 10)
dpth10Mdl.show(view="Tree", vlabel_hover = True)
Now, let us evaluate dpth10Mdl on the training set and validation set.
print dpth10Mdl.evaluate(glbObsFit)['accuracy']
print dpth10Mdl.evaluate(glbObsOOB)['accuracy']
Checkpoint: We should see that dpth10Mdl has even better performance on the training set than dtrMdl did on the training set.
Quiz Question: How does the performance of dpth10Mdl on the validation set compare to dtrMdl on the validation set? Is this a sign of overfitting?
Every mistake the model makes costs money. In this section, we will try and quantify the cost of each mistake made by the model.
Assume the following:
Let's write code that can compute the cost of mistakes made by the model. Complete the following 4 steps:
First, let us make predictions on glbObsOOB using the dtrMdl:
#predictions = dtrMdl.predict(glbObsOOB)
glbObsOOB[target + '.dtrMdl.P'] = dtrMdl.predict(glbObsOOB)
print(glbObsOOB)
False positives are predictions where the model predicts +1 but the true label is -1. Complete the following code block for the number of false positives:
glbObsOOB[target + '.dtrMdl.P.FP'] = ((glbObsOOB[target + '.dtrMdl.P'] == +1) &
(glbObsOOB[target ] == -1))
glbObsOOB[target + '.dtrMdl.P.FP'].show(view = 'Categorical')
print(value_counts(glbObsOOB[[target + '.dtrMdl.P.FP']]))
print(glbObsOOB)
False negatives are predictions where the model predicts -1 but the true label is +1. Complete the following code block for the number of false negatives:
glbObsOOB[target + '.dtrMdl.P.FN'] = ((glbObsOOB[target + '.dtrMdl.P'] == -1) &
(glbObsOOB[target ] == +1))
glbObsOOB[target + '.dtrMdl.P.FN'].show(view = 'Categorical')
print(value_counts(glbObsOOB[[target + '.dtrMdl.P.FN']]))
print(glbObsOOB)
print(glbObsOOB[glbObsOOB[target + '.dtrMdl.P.FN'] == 1])
Quiz Question: Let us assume that each mistake costs money:
What is the total cost of mistakes made by dtrMdl on glbObsOOB?
print(10000 * glbObsOOB[glbObsOOB[target + '.dtrMdl.P.FN'] == 1].shape[0] +
20000 * glbObsOOB[glbObsOOB[target + '.dtrMdl.P.FP'] == 1].shape[0] +
0)